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Abstract 

In the dynamic indexing problem, we must maintain a changing collection of text documents 
so that we can efficiently support insertions, deletions, and pattern matching queries. We are 
especially interested in developing efficient data structures that store and query the documents 
in compressed form. All previous compressed solutions to this problem rely on answering rank 
and select queries on a dynamic sequence of symbols. Because of the lower bound in [Fredman 
and Saks, 1989], answering rank queries presents a bottleneck in compressed dynamic indexing. 
In this paper we show how this lower bound can be circumvented using our new framework. 
We demonstrate that the gap between static and dynamic variants of the indexing problem can 
be almost closed. Our method is based on a novel framework for adding dynamism to static 
compressed data structures. Our framework also applies more generally to dynamizing other 
problems. We show, for example, how our framework can be applied to develop compressed 
representations of dynamic graphs and binary relations. 


1 Introduction 

Motivated by the preponderance of massive data sets (so-called “big data”), it is becoming increas¬ 
ingly useful to store data in compressed form and moreover to manipulate and query the data while 
in compressed form. For that reason, such compressed data structures have been developed in the 
context of text indexing, graph representations, XML indexes, labeled trees, and many other ap¬ 
plications. In this paper we describe a general framework to convert known static compressed data 
structures into dynamic compressed data structures. We show how this framework can be used 
to obtain significant improvements for two important dynamic problems: maintaining a dynamic 
graph and storing a dynamic collection of documents. We expect that our general framework will 
find further applications. 

In the indexing problem, we keep a text or a collection of texts in a data structure, so that, 
given a query pattern, we can list all occurrences of the query pattern in the texts. This problem is 
one of the most fundamental in the area of string algorithms. Data structures that use 0{n log n) 
bits of space can answer pattern matching queries in optimal time 0(|P| +occ), where |P| denotes 
the length of the query pattern P and occ is the number of occurrences of P. Because of the large 
volumes of data stored in text data bases and document collections, we are especially interested in 
data structures that store the text or texts in compressed form and at the same time can answer 
pattern matching queries efficiently. Compressed indexing problem was extensively studied in the 
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static scenario and during the last two decades significant progress has been achieved; we refer to 
a survey j33]J for an overview of previous results in this area. 

In the dynamic indexing problem, also known as the library management problem, we maintain 
a collection of documents (texts) in a data structure under insertions and deletions of texts. It 
is not difficult to keep a dynamic collection of texts in 0(n) words (i.e., 0(n log n) bits) and 
support pattern matching queries at the same time. For instance, we can maintain suffixes of 
all texts in a suffix tree; when a new text is added or deleted, we add all suffixes of the new 
text to the suffix tree (respectively, remove all suffixes of the deleted text from the suffix tree). 
We refer the reader Section [A. 2 for a more detailed description of the 0(n log n)-bit solution. The 
problem of keeping a dynamic document collection in compressed form is however more challenging. 
Compressed data structures for the library management problem were considered in a number of 
papers [TOj EQ] [9] [30], [28] EH [TH] [29] [19l EH EH [35] ■ In spite of previous work, the query times 
of previously described dynamic data structures significantly exceed the query times of the best 
static indexes. In this paper we show that the gap between the static and the dynamic variants of 
the compressed indexing problem can be closed or almost closed. Furthermore we show that our 
approach can be applied to the succinct representation of dynamic graphs and binary relations that 
supports basic adjacency and neighbor queries. Again our technique significantly reduces the gap 
between static and dynamic variants of this problem. 

These problems arise often in database applications. For example, reporting or counting occur¬ 
rences of a string in a dynamic collection of documents is an important operation in text databases 
and web browsers. Similar tasks also arise in data analytics. Suppose that we keep a search log 
and want to find out how many times URLs containing a certain substring were accessed. Finally 
the indexing problem is closely related to the problem of substring occurrence estimation [38]. The 
latter problem is used in solutions of the substring selectivity estimation problem PH EH G7] ; we 
refer to [38] for a more extensive description. Compressed storage schemes for such problems help 
us save space and boost general performance because a larger portion of data can reside in the 
fast memory. Graph representation of data is gaining importance in the database community. For 
instance, the set of subject-predicate-object RDF triples can be represented as a graph or as two 
binary relations m Our compressed representation applied to an RDF graph enables us to sup¬ 
port basic reporting and counting queries on triples. An example of such a query is given x, to 
enumerate all the triples in which x occurs as a subject. Another example is, given x and p, to 
enumerate all triples in which x occurs as a subject and p occurs as a predicate. 


Previous Results. Static Case We will denote by \T\ the number of symbols in a sequence T 
or in a collection of sequences; T[i\ denotes the i-th element in a sequence T and T[i..j] = T\i]T[i + 
1] • • • T\j\. Suffix trees and suffix arrays are two handbook data structures for the indexing problem. 
Suffix array keeps (references to) all suffixes T[i..n\ of a text T in lexicographic order. Using a suffix 
array, we can find the range of suffixes starting with a query string P in t range = 0(|P| + logn) 
time; once this range is found, we can locate each occurrence of P in T in fi OC ate = 0(1) time. A 
suffix tree is a compact trie that contains references to all suffixes T[i..n\ of a text T. Using a suffix 
trie, we can find the range of suffixes starting with a query string P in f range = 0(|P|) time; once 
this range is found, we can locate every occurrence of P in T in fi oca te = 0(1) time. A large number 
of compressed indexing data structures are described in the literature; we refer to [33] for a survey. 
These data structures follow the same two-step procedure for answering a query: first, the range 
of suffixes that start with P is found in 0(t range ) time, then we locate each occurrence of P in T 
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in Oft locate) time. Thus we report all occ occurrences of P in 0(t ran ge + occ • ti OC ate) time. We can 
also extract any substring T[i..i +£] of T in O(fextract) time. Data structures supporting queries 
on a text T can be extended to answer queries on a collection C of texts: it suffices to append a 
unique symbol $j at the end of every text T) from C and keep the concatenation of all T) in the 
data structure. 

We list the currently best and selected previous results for static text indexes with asymp¬ 
totically optimal space usage in Table [lj All listed data structures can achieve different space- 
time trade-offs that depend on parameter s: an index typically needs about nHf. + o(n log <r) + 
0(n log n/s) bits and fi OC ate is proportional to s. Henceforth Hk denotes the k-th order empirical 
entropy and a denotes the alphabet siz^| We assume that k < adog^ n— 1 for a constant 0 < a < 1. 
Hk is the lower bound on the average space usage of any statistical compression method that en¬ 
codes each symbol using the context of k previous symbols (32j . The currently fastest such index 
of Belazzougui and Navarro [7] reports all occurrences of P in 0(|.P| + s ■ occ) time and extracts a 
substring of length £ in 0(s + £) time. Thus their query time depends only on the parameter s and 
the length of P. Some recently described indices PIE] achieve space usage nHk + oinHk ) + o(n) 
or nHk + o(nHk ) + 0(n) instead of nHk + o(n log cr). 

If we are interested in obtaining faster data structures and can use 0(nloga) bits of space, 
then better trade-offs between space usage and time are possible ED [22]. For the sake of space, 
we describe only one such result. The data structure of Grossi and Vitter [22] uses 0(n log cr) bits 
and reports occurrences of a pattern in 0{\P\/ log CT n + log 6 n + occ log 6 n) time; see Table [3} We 
remark that the fastest data structure in Table[l]needs D(n log 1 ^ 6 n ) space to obtain the same time 
for tiocate as in [22|. If a data structure from Table[l]uses O(nlogcr) space, then ti ocate = ^(log^ n). 

Dynamic Document Collections In the dynamic indexing problem, we maintain a collection 
of documents (strings) under insertions and deletions. An insertion adds a new document to the 
collection, a deletion removes a document from the collection. For any query substring P, we 
must return all occurrences of P in all documents. When a query is answered, relative positions of 
occurrences are reported. To be precise, we must report all pairs (doc, off), such that P occurs in 
a document doc at position off. We remark that relative positions of P (with respect to document 
boundaries) are reported. Hence an insertion or a deletion of a document does not change positions 
of P in other documents. Indexes for dynamic collections of strings were also studied extensively [TOl , 
[5111 0 ESI EH EH E3 ESI EH EZ1 EH- The fastest previously known result for the case of large 
alphabets is described in [35]. Their data structure, that builds on a long line of previous work, 
uses nHk + o(n log a) + 0(n log n/s) + 0{p log n) bits of space, where p is the number of documents; 
queries are answered in 0(\P\ log n/ log log n+occ-sdog n/ log log n) time and updates are supported 
in 0(logn + \T U \ log n/ log log n) amortized time, where T is the document inserted into or deleted 
from the collection. See Table [2] for some other previous results. 

An important component of previous dynamic solutions is a data structure supporting rank and 
select queries: a sequence S over an alphabet £ = {l,...,cr}is kept in a data structure so that 
the ?'-th occurrence of a symbol a € E and the number of times a symbol a occurs in 5[l..i] for any 
1 < i < n can be computed. Thus progress in dynamic indexing was closely related to progress 
in dynamic data structures for rank and select queries. In [35] the authors obtain a dynamic data 

Het S be an arbitrary string over an alphabet E = {1,..., cr}. A context Si € £ fc is an arbitrary string of length 
k. Let n Sil a be the number of times the symbol a is preceded by a context s, in S and n 3i = Then 

Hk = — l°g '7' ° the k-th order empirical entropy of S. 
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Ref. 

Space (TO(n^p)) 
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^locate 

^extract 

(7 

m 

m 

m 

0 

m 

m 

0 

m 

nH k + o(n log a) 
nH k + o(n log a) 
nH k + o(n log a) 
nH k + o(n log a) 
nH k + o(nH k ) + o(n ) 
nH k + o(nH k ) + o(n ) 
nH k + o(nH k ) + o(n ) 
nH k + o(nH k ) + 0(n) 

0(\P\ logcr + log 4 n) 
0(\P\ logn) 

°(l P lE& 

0(\P\ log logo-) 

°(l P lk5,gn) 

0{\P\ log log a) 

0(\P\) 

0(\P \) 

0(s log a) 

0(8 ) 

^( S log logn) 

0(s log logo-) 

0(*Efe) 

0(s log logo-) 
0(8) 

0(8 ) 

0((s + £) log a) 

0(s + £) 

OKs + Px^) 

0((s + £) log logo-) 

°(( s + ^) log logn) 

0((s + £) log logo-) 
0(s + £) 

0(8 + £) 

log const n 


Table 1: Asymptotically optimal space data structures for static indexing. Occurrences of a string 
P can be found in 0(f ran ge + locate ■ occ) time. A substring T[i..i + 1} of T can be extracted in 
0(f extract) time. Results are valid for any k < ct log a n — 1 and 0 < a < 1. 


Ref. 
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(+0(n^*p) + p log n) 
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^extract 
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a 

El 

O(n) 

0(|P| logn) 

0(log z n) 

0((logn + () logn) 

0{\T U \ logn) 

const 

ESI 

nHf~ + o(nlog a) 

0(|P| logn logcr) 

0(s logn logcr) 

((s + t) logn logcr) 

0{\T U \ logn logcr) 


ESI 

nH}~ + o(nlog a) 

0(]P| logn) 

O(slogn) 

0((s + f) logn) 

0(|Pm| logn) 


ESI 

nHk + o(n logcr) 

0(1 pi Io 8“ t 

log logn/ 

0(s log n/ log log n) 


0(logn + |T t ,| I ^g s ) A 


Our 

nHk + o(^logcr) 

0(jP| log logn) 

o(«) 

0(s + t) 

0(\T U \ log i+E n) 

log COIlst n 

Our 

nHk + o(n log a) 

0(|P| log logn log logcr) 

0(s log logcr) 

0((s + £) log logcr) 

0(|Pul log E n)/ 
0(jr u |(log E n + S )) 


Our 

nHk + o(nlogcr) 

0(jP| log logn) 

O(s) 

0(s + t) 

0{\T u \\og e n) n / 
0(jTJ(log £ n + s) R 



Table 2: Asymptotically optimal space data structures for dynamic indexing. The same notation 
as in Table [T] is used. Randomized update procedures that achieve specified cost in expectation 
are marked with R. Amortized update costs are marked with A. T u denotes the document that 
is inserted into (resp. deleted from) the data structure during an update operation. In previous 
papers on dynamic indexing only the cases of s = log n or s = log^ n log log n was considered, but 
extension to an arbitrary value of s is straightforward. 


structure that supports rank and select in 0(\ogn/ log log n) time. By the lower bound of Fredman 
and Saks EE], this query time is optimal in the dynamic scenario. It was assumed that the solution 
of the library management problem described in [35] achieves query time that is close to optimal. 

Our Results In this paper we show that the lower bound on dynamic rank-select problem can 
be circumvented and describe data structures that need significantly less time to answer queries. 
Our results close or almost close the gap between static and dynamic indexing. If the alphabet 
size a = log 0 ^ n, we can obtain an ( nH k + o{n logo - ) + 0(n^p))-bit data structure that answers 
queries in 0(|P| log log n + occ • s) time; updates are supported in 0(|T U | log 1+£ n) time, where T u 
denotes the document that is inserted into or deleted from the index. Our second data structure 
supports updates in 0(\T U \ log £ n) expected time and answers queries in 0(\P\ log log n + occ • s) 
time for an arbitrarily large alphabet If the update procedure is deterministic, then queries are 

^Dynamic indexes also need O(plogn) bits to navigate between documents, where p is the number of documents. 
Since p log n is usually negligible in comparison to n, we ignore this additive term, except for Tables [2] and [3} to 
simplify the description. 
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answered in 0((|P| log log n + occ • s) log log a) time and updates are supported in 0(\T U \ log e n) 
worst-case time. See Table [2j If 0(n log a) bits of space are available, then our dynamic data 
structure matches the currently fastest static result of Grossi and Vitter [22]. We can report all 
occurrences of a pattern P in 0(\P\/ log CT n + log £ n + occ • log £ n) time. This is the first compressed 
dynamic data structure that achieves i range = o(|P|) if a = n 0 ^. Compared to the fastest previous 
data structure that needs the same space, we achieve 0{\ogn\og a n) factor improvement in query 
time. A variant of this data structure with deterministic update procedure answers queries in 
0(|P | (log log n) 2 / logg. n + log n + occ • log £ n) time. See Table [3| 

Our data structures can also count occurrences of a pattern P in 0(f COU nt) time. For previously 
described indexes t coun t = Cange- In our case, f count = Cange + log n/ log log n or Count = (Cange + 
log n/ log log n) log log n. Times needed to answer a counting query are listed in Table[4j However, if 
our data structures support counting queries, then update times grow slightly, as shown in Table |4j 
All of the above mentioned results are obtained as corollaries of two general transformations. 
Using these transformations, that work for a very broad class of indexes, we can immediately turn 
almost any static data structure with good pre-processing time into an index for a dynamic collec¬ 
tion of texts. The query time either remains the same or increases by a very small multiplicative 
factor. Our method can be applied to other problems where both compressed representation and 
dynamism are desirable. 

Binary Relations and Graphs One important area where our techniques can also be used 
is compact representation of directed graphs and binary relations. Let R C L x O be a binary 
relation between labels from a set L and objects from a set O. Barbay et al. [5] describe a compact 
representation of a static binary relation R (i.e., the set of object-label pairs) that consists of a 
sequence Sr and a bit sequence Br. Sr contain the list of labels related to different objects and 
is ordered by object. That is, Sr lists all labels related to an object oi, then all labels related to 
an object 02 , etc. The binary sequence Br contains unary-encoded numbers of labels related to 
objects 01 , 02 , • • •• Barbay et al [5] showed how Sr and Br can be used to support basic queries 
on binary relations, such as listing or counting all labels related to an object, listing or counting 
all objects related to a label, and telling whether a label and an object are related. Their method 
reduces queries on a binary relation R to rank, select, and access queries on Sr and Br. Another 
data structure that stores a static binary relation and uses the same technique is described in [ 2 ]. 
Static compact data structures described in [5, 2 ] support queries in O(loglogcq) time per reported 
datum, where 07 is the number of labels. For instance, we can report labels related to an object 
(resp. objects related to a label) in 0((k + 1 ) log log 07 ) time, where k is the number of reported 
items; we can tell whether an object and a label are related in O(loglogcq) time. In |35| . the 
authors describe a dynamization of this approach that relies on dynamic data structures answering 
rank and select queries on dynamic strings Sr and Br. Again the lower bound on dynamic rank 
queries sets the limit on the efficiency of this approach. Since we need H(logn/loglogn) time to 
answer rank queries, the data structure of Navarro and Nekrich [35] needs 0(logn/loglogn) time 
per reported item, where n is the number of object-label pairs in R. Updates are supported in 
0(log nj log log n) amortized time and the space usage is nH + 07 log 07 + t log t + 0{n + 01 log 07 ) 
where n is the number of pairs, H is the zero-order entropy of the string Sr, <ji is the number 
of labels and t is the number of objects.In |3SJ the authors also show that we can answer basic 
adjacency and neighbor queries on a directed graph by regarding a graph as a binary relation 
between nodes. Again reporting and counting out-going and in-going neighbors of a node can be 
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Ref. 

Grange 

^locate 

t extract 

Update 

O 

EH 

m 

ESI 

Our 

Our 

0(\P\/log a n + log E n) 
o(\P\ logn) 

0(|P| log?i) 

0(|P|/log ff n + log 6 n) 
OdPKloglognWloga- n + log n) 

0(log E n) 

0(log 2 n) 

Oflog n logo - n ) 
0(log E n) 

0(log E n) 

Off/log ff n) 

0((logn + i) logn) 
Offlog^n + ^logn) 
0(1/ log,, n) 

0(1/ log CT n) 

static 

0(|T u |logn) 
0(\T U \ logn) 
0(|T u |log E n) R 
0(\T U \ log £ n) 

const 


Table 3: 0{n log <r)-bit indexes. Dynamic data structures need additional plogn bits. Randomized 
update costs are marked with R. 


performed in O(logn) time per delivered datum. 

In this paper we show how our method for dynamizing compressed data structures can be 
applied to binary relations and graphs. Our data structure supports reporting labels related to an 
object or reporting objects related to a label in 0 (loglogn • log log 07 ) time per reported datum. 
We support counting queries in O(logn) time and updates in 0(log £ n) worst-case time. The same 
query times are also achieved for the dynamic graph representation. The space usage of our data 
structures is dominated by nH where n is the number of pairs in a binary relation or the number 
of edges in a graph and H is the zero-order entropy of the string Sr defined above. Thus the space 
usage of our data structure matches that of |35| up to lower-order factors. At the same time we 
show that reporting queries in a dynamic graph can be supported without dynamic rank and select 
queries. 


Space 

Counting 

Updates 

CT 

nH k + o(n log a) 
nH k + o(n log a) 
nH k + o(n log o') 
0 (n log a) 

0(n log a) 

0(\P\ log logn + logn) 

0 ((|P log log a log log n + logn) 
0((\P\ log logn + logn) 
0{\P\/\og a n + log n/log log n) 

0(\P (loglogn) 2 /log CT n + log?r) 

0(\T U \ logn) 
0(\T U \ logn) 
0(|T M |logn) R 
0{\T U logn) R 
0{\T U j log n) 

|0g con st n 


Table 4: Costs of counting queries for our data structures. Randomized update costs are marked 
with R. The first three rows correspond to the last three rows in Table [2j the last two rows 
correspond to the last two rows in Table [3} 


Overview The main idea of our approach can be described as follows. The input data is dis¬ 
tributed among several data structures. We maintain a fraction of the data in an uncompressed 
data structure that supports both insertions and deletions. We bound the number of elements 
stored in uncompressed form so that the total space usage of the uncompressed data structure 
is affordable. Remaining data is kept in several compressed data structures that do not support 
updates. New elements (respectively new documents) are always inserted into the uncompressed 
data structure. Deletions from the static data structures are implemented by the lazy deletions 
mechanism: when a deletion takes place, then the deleted element (respectively the document) is 
marked as deleted. We keep positions of marked elements in a data structure, so that all elements 
in a query range that are not marked as deleted can be reported in 0(1) time per element. When 
a static structure contains too much obsolete data (because a certain fraction of its size is marked 
as deleted), then this data structure is purged: we create a new instance of this data structure that 
does not contain deleted elements. If the uncompressed data structure becomes too big, we move 
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its content into a (new) compressed data structure. Organization of compressed data structures is 
inspired by the logarithmic method, introduced by Bentley and Saxe | 8 ]: the size of compressed 
data structures increases geometrically. We show that re-building procedures can be scheduled 
in such way that only a small fraction of data is kept in uncompressed form at any given time. 
Since the bulk of data is kept in static data structures, our approach can be viewed as a general 
framework that transforms static compressed data structures into dynamic ones. 

In Section [2] we describe Transformation [lj Transformation [lj based on the approach outlined 
above, can be used to turn a static indexing data structure into a data structure for dynamic 
collection of documents with amortized update cost. The query costs of the obtained dynamic 
data structure are the same as in the underlying static data structure. In Section [3] we describe 
Transformation [2] that turns a static indexing data structure into a dynamic data structure with 
worst-case update costs. We use more sophisticated division into sub-collections and slightly dif¬ 
ferent re-building procedures in Transformation [2j In Section [4] we describe how to obtain new 
solutions of the dynamic indexing problem using our static-to-dynamic transformations. Finally 
Section [5] contains our data structures for dynamic graphs and binary relations. 

2 Dynamic Document Collections 

In this section we show how a static compressed index X s can be transformed into a dynamic 
index X,i. C will denote a collection of texts Ti,... ,T p . We say that an index X s is (u(n),w(n))- 
constructible if there is an algorithm that uses 0(n ■ w[n )) additional workspace and constructs 
X s in 0(n ■ u(n)) time. Henceforth we make the following important assumptions about the static 
index X s . X s needs at most |5|0(5) bits of space for any symbol sequence S and the function (/>(•) 
is monotonous: if any sequence S' is a concatenation of Si and S 2 , then |S|</>(S) > |Si|0(Si) + 
|S 2 | 0 (S 2 )- We also assume that X s reports occurrences of a substring in C using the two-step 
method described in the introduction: first we identify the range [a, b] in the suffix array, such that 
all suffixes that start with P are in [a, 6 ]; then we find the positions of suffixes from [a,b] in the 
document(s). These operations will be called range-finding and locating. Moreover the rank of any 
suffix Ti[l..\ in the suffix array can be found in time O(isA)- The class of indexes that satisfy these 
conditions includes all indexes that are based on compressed suffix arrays or the Burrows-Wheeler 
transform. Thus the best currently known static indexes can be used in Transformation [l] and the 
following transformations described in this paper. 

Our result can be stated as follows. 

Transformation 1 Suppose that there exists a static ( u(n),w(n))-constructible indexX s that uses 
|5|^(5) space for any document collection S. Then there exists a dynamic index Xd that uses 
|5|^(5) + 0(\S\{^ -f w(n) + ^ 7 -)) space for a parameter r = O (log n/log log n) ; Xd supports 
insertions and deletions of documents in 0(u(n) log e n) time per symbol and 0(u(n) -T+tsA+log 6 n) 
time per symbol respectively. Update times are amortized. The asymptotic costs of range-finding, 
extracting, and locating are the same in X s and Xd. 

We start by showing how to turn a static index into a semi-dynamic deletion-only index using 
0((n/r) logr) additional bits. Then we will show how to turn a semi-dynamic index into a fully- 
dynamic one. 


7 



Co 


Cl 



Figure 1: Sub-collections Cj for dynamizing a deletion-only index. A data structure for Co is 
fully-dynamic and stores documents in uncompressed form. 


Supporting Document Deletions We keep a bit array B whose entries correspond to positions 
in the suffix array SA of C. B[j ] = 0 if ,S'A[j] is a suffix of some text Tf, such that Tf was already 
deleted from C and B[j] = 1 otherwise. We keep a data structure V that supports the following 
operations on B: zero(j) sets the j-th bit in B to 0; report(ji, J 2 ) reports all 1-bits in B[j\..j- 2 \. V 
is implemented using Lennna[3j so that zero(i ) is supported in 0(\og £ n) time and report(j\, is 
answered in 0(k ) time, where k is the number of output bit positions. If B contains at most n/r 
zeros, then B and V need only 0((nlog t)/t) bits. Lemma [ 3 ] is proved in Section A.l 


When a document Tf is deleted, we identify the positions of Tf s suffixes in SA and set the 
corresponding bits in B to 0. When the number of symbols in deleted documents equals (n/r), we 
re-build the index for C without deleted documents in 0(n ■ u(n)) time. The total amortized cost 
of deleting a document is 0{u(ti)t + fsA + log e n) per symbol. To report occurrences of some string 
P in C, we identify the range [s..e] such that all suffixes in Sbl[s..e] start with P in 0(t range ) time. 
Using V, we enumerate all j, such that s < j < e and B[j] = 1. For every such j, we compute 
SA[j] in OOlocate) time. 


Fully-Dynamic Index We split C into a constant number of sub-collections Co,Ci,. .. ,C r such 
that \Ci\ < max* for all i. The maximum size of the z-th sub-collection, rnaxj, increases geometri¬ 
cally: maxo = 2n/log 2 n and maxj = 2(n/ log 2 n) log £ '* n for a constant £ > 0; see Fig. [TJ There is 
no lower bound on the number of symbols in a sub-collection Cf, for instance, any C{ can be empty. 
Our main idea is to store Co in uncompressed form and C{ for i > 1 in semi-static deletion-only data 
structures. Insertions into Cj for i > 1 are supported by re-building the semi-static index of Cj. We 
also re-build all sub-collections when the total number of elements is increased by a constant factor 
(global re-build). 

We store the document collection Co in uncompressed form. Suffixes of all documents in Co are 
kept in an (uncompressed) suffix tree V 0 . We can insert a new text T into V 0 or delete T from Vq 
in 0(|T|) time. Using Vo, all occurrences of a pattern P in Co can be reported in 0(\P\ +occ) time. 
Since |Co| < 2n/log 2 n, we need 0{n/ log n) bits to store Co- For completeness we will describe the 
data structure for Co in Section |A.2| 

Every Cj for i > 1 is kept in a semi-dynamic data structure described in the first part of this 
section. Let size(i) denote the total length of all undeleted texts in Cj. When a new document 
T must be inserted, we find the first (smallest) collection Cj such that Xa=o size(i) + \T\ < maxj 
where rnaxj = 2(n/ log 2 n) log e ’- ? ' n. That is, we find the first subcollection Cj that can accommodate 
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the new text T and all preceding subcollections without exceeding the size limit. If j = 0, we insert 
the new text into Co- Otherwise, if j > 1, we discard the old indexes for all C, where 0 < i < j, set 
Cj = (U i =0 Ci) U T and construct a new semi-static index for Cj. If J2j=o s ^ ze (^) + IU > nraxj for 
all j, we start a global re-build procedure: all undeleted texts from old sub-collections are moved 
to the new sub-collection C r and parameters rnaxj are re-calculated; new sub-collections Cj for 
0 < i < r are initially empty after the global re-build. 

We start a global re-build procedure when the total number of elements is at least doubled. 
Hence, the amortized cost of a global re-build is 0(u(n)). The amortized cost of re-building sub¬ 
collections can be analyzed as follows. When a sub-collection Cj is re-built, we insert all symbols 
from subcollections Cj, 0 < i < j and the new text T into Cj. Our insertion procedure guarantees 
that =i size(j) + \T\ > maxj_i . We need 0(maxj • u{n )) time to construct a new index for Cj. 
The cost of re-building Cj can be distributed among the new text symbols inserted into Cj. Since 
nraxj_i = rnaxj /log £ n, the amortized cost of inserting a new symbol into Cj is 0(u(n ) • log £ n). 
Every text is moved at most once to any subcollection Cj for any j such that 1 < j < [2/e]. Hence 
the total amortized cost of an insertion is 0 ((l/e)u(n) • log £ n) per symbol. 

A query is answered by querying all non-empty sub-collections Cj for i = 0,1,..., r. Since 
r = 0(1), query times are the same as in the underlying static index. Splitting a collection into 
sub-collection does not increase the space usage because the function (/>(•) is monotonous. We need 
0((n/r)logr) bits to keep data structures V for all Cj. Another 0{nw(n )) bits are needed for 
global and local re-builds. Finally we need 0((n/r) log cr) bits to store the symbols from deleted 
documents. Since there are no more than 0(n/r) deleted symbols, we use 0((n/r) log a)+o(n log a) 
additional bits to store them; a more detailed analysis is given in Section |A.5| Hence, the total 
space overhead of our dynamic index is 0(n(w{n) + (log a + logr)/r)). 

A data structure with faster insertions and slightly higher query time can be obtained by 
increasing the number of sub-collections Cj to O (log log n). We describe this variant of our method 


in Appendix A.4 


3 Worst-Case Updates 

In this section we will prove the following result. 

Transformation 2 Suppose that there exists a static ( u(n),w(n))-constructible index X s that uses 
|5|<))(S') space for any document collection S. Then there exists a dynamic index Id that uses 
|S'|<)>(S') + 0(151 1 °g (T + lo g' r + l H n -) j S p ace f or any parameter r = O (log nj log log n); Zd supports inser¬ 
tions and deletions of documents in 0(u(n) log £ n) time per symbol and 0(u(n ) • (log £ n + r log r) + 
tsA) time per symbol respectively. The asymptotic costs of range-finding increases by 0(r); the 
costs of extracting and locating are the same in Z s and Zd- 

We use the index of Transformation [l] as the starting point. First we give an overview of our 
data structure and show how queries can be answered. Then we describe the procedures for text 
insertions and deletions. 


Overview The main idea of supporting updates in worst-case is to maintain several copies of the 
same sub-collection. An old copy of Cj is locked while a new updated version of Cj + \ that includes 
Cj is created in the background. When a new version of Cj+ 1 is finished, we discard an old locked 
sub-collection. When a new document T must be inserted, we insert it into Cq if |Cq| + \T\ < maxo. 


9 







Otherwise we look for the smallest j > 0, such that Cj+i can accommodate both Cj and T ; then we 
move both T and all documents from Cj into c j+ fl If the new document T is large, \T\ > maxj /2, 
we can afford to re-build Cj+i immediately after the insertion of T. If the size of T is smaller than 
rnaxj /2, re-building of Cj +1 is postponed. For every following update, we spend 0(log e n • u(n )) 
time per symbol on creating the new version of Cj+ 1 . The old versions of Cj, Cj+\ are retained until 
the new version is completed. If the number of symbols that are marked as deleted in Cj exceeds 
maxj /2, we employ the same procedure for moving Cj to Cj+y. Cj is locked and we start the process 
of constructing a new version Cj+i that contains all undeleted documents from Cj. 

The disadvantage of delayed re-building is that we must keep two copies of every document 
in Cj U Cj .|_i until new Cj + \ is completed. In order to reduce the space usage, we keep only a 
fraction of all documents in sub-collections Cj. All Ci for 0 < i < r will contain 0{ji/t) symbols, 
where r is the parameter determining the trade-off between space overhead and query time. The 
remaining documents are kept in top sub-collections 71, ..., T g where g < 2r. Top sub-collections 
are constructed using the same delayed approach. But once T is finished, no new documents are 
inserted into T- We may have to re-build a top collection or merge it with another Tj when the 
fraction of deleted symbols exceeds a threshold value 1/r. We employ the same rebuilding-in-the- 
background approach. However, we will show that the background procedures for maintaining Ti 
can be scheduled in such a way that only one Tj is re-built at any given moment. Hence, the total 
space overhead due to re-building and storage of deleted elements is bounded by an additive term 
0(n(log<7 + w{ji))/t). 

Data Structures We split a document collection C into subcollections Co, C\, ..., C r , C\, ..., C r 
and top subcollections T\, ..., T g where g = O(t). We will also use auxiliary collections M\, ..., 
M r +\ and temporary collections Tempi , • • Temp r . Tempi are also used to answer queries but 
each non-empty Tempi contains exactly one document; Tempi are used as temporary storage for 
new document that are not yet inserted into “big” collections. The sizes of sub-collections can be 
defined as a function of parameter rif such that rif = @(n); the value of nj changes when n becomes 
too large or too small. Let max, = 2 [rif/ log 2 n) log* £ n. We maintain the invariant \Ci\ < max, for 
alH, 0 < i < r, but r is chosen in such way that n//log 2-re n/ = n//r. Every T contains H(n//r) 
symbols. If Ti contains more than one text, then its size is at most 4n//r; otherwise T can be 
arbitrarily large. When a collection Cj is merged with Cj+\, the process of re-building Cj can be 
distributed among a number of future updates (insertions and deletions of documents). During 
this time Cj is locked: we set Cj = Cj and initialize a new empty sub-collection Cj. When a new 
subcollection A/j+i = Cj + \ U Cj is completed, we set Cj+i = A/j+i and discard old Cj + \ and Cj. A 
query is answered by querying all non-empty Cj, Ci, Tempi, and %■ Therefore the cost of answering 
a range-finding query grows by 0(t). The costs of locating and extracting are the same as in the 
static index. We show main sub-collections used by our method on Figure [2] 

Insertions When a document T is inserted, we consider all j, such that 0 < j < r and the 
data structure Cj is not empty. For every such j, we spend 0(\T\\og £ n ■ u(n)) units of time 
on constructing Mj+i- If A/}+i for some 0 < j < r — 1 is completed, we set Cj+i = A/}+i and 

3 Please note the difference between Transformations [l] and [ 2 ] In Transformation [I] we look for the sub-collection 
Cj that can accommodate the new document and all smaller sub-collections Co, ■■■, Cj-i. In Transformation [2] we 
look for the sub-collection Cj+ 1 that can accommodate that can accommodate the new document and the preceding 
sub-collection Cj . We made this change in order to avoid some technical complications caused by delayed re-building. 
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Figure 2: Dynamization with worst-case update guarantees. Only main sub-collections used for 
answering queries are shown. C' r and auxiliary collections Mi are not shown. 

Mj+i = Tempj +1 = 0; if M r +i is completed, we set T g +i = M r +i, increment the number of 
top collections g, and set M r +\ = Temp r+i = 0. Then we look for a sub-collection that can 
accommodate the new document T. If \T\ > n/r , we create the index for a new sub-collection 
7 1 that contains a single document T. If \T\ < n/r, we look for the smallest index j, such that 
|Cj+i| + \Cj\ + \T\ < maxj + i. That is, Cj+ 1 can accommodate both the preceding sub-collection 
Cj and T. If |T| > max.,- /2, we set Cj+i = Cj U Cj+i U T and create an index for the new Cj+ ± in 
0(\Cj+i\ • u{n )) = 0{\T\ log e n • u{n )) time. If \T\ < ma Xj /2, the collection Cj is locked. We set 
Cj = Cj , Cj = 0 and initiate the process of creating Mj+\ = Cj UC J+ i U T. The cost of creating the 
new index for A/^+i will be distributed among the next max.,- update operations. We also create a 
temporary static index Tempj +1 for the text T in 0(\T\u(n )) time. This procedure is illustrated 
on Fig. [3} If the index j is not found and \Ci\ + |C,; + i| + |T| > maxj + i for all*, 0 < i < r, we lock 
C r (that is, set C r = C r and C r = 0) and initiate the process of constructing M r +\ = C r L) T. We 
also create a temporary index Temp r +1 for the document T in 0(|T|rt(n)) time. 

Deletions Indexes for sub-collections Cj, 1 < i < r, and Tj, 1 < j < g, support lazy deletions in 
the same way as in Section [2j when a document is deleted from a sub-collection, we simply mark 
the positions of suffixes in the suffix array as deleted and set the corresponding bits in the bit vector 
B to 0. Augmenting an index so that lazy deletions are supported is done in exactly the same way 
as in Section [2j 

We will need one additional sub-collection C' r to support deletions. If a sub-collection Cj for 
1 < j < r* — 1 contains ma Xj /2 deleted elements, we start the process of re-building Cj and merging 
it with Cj- 1 - 1 . This procedure is the same as in the case of insertions. We lock Cj by setting Cj = Cj 
and Cj = 0. The data structure A/}+i = Cj + \ U Cj will be re-built during the following max.,- /2 
updates. If a sub-collection C r contains max r /2 deleted symbols, we set C' r = C r and C r = 0. The 
sub-collection C' r will be merged with the next sub-collection % to be re-built. 

If a collection 71 contains a single document and this document is deleted, then 71 is discarded. 
We also bound the number of deleted symbols in any 71 by nj/r. This is achieved by running the 
following background process. After each series of nj/(2rlogr) symbol deletions, we identify 7j 
that contains the largest number of deleted symbols. During the next nj /(2t log r) symbol deletions 


11 












Figure 3: Suppose that Cj+ 1 is the first sub-collection that can accommodate both Cj and a new 
document T n . If Cj must be rebuilt in the background, we “rename” Cj to Cj and initialize another 
(initially empty) Cj. New document T n is put into a separate collection Tempj + \ (a). A background 
process creates a new collection A/}+i that contains all documents from Cj, Cj + \ and Tempj+i (b). 
When A/j+i is finished, we discard Cj + \, Cj and Tempj+i, and set Cj+ 1 = J\fj+i (c). Our procedure 
guarantees that A/}+i is completed before the new sub-collection Cj must be re-built again. 


we build the new index for Tj without the deleted symbols. At the same time we remove the deleted 
symbols from C' r if C' r exists. If C' r exists and contains at least nf/2r undeleted symbols, we create 
an index for a new sub-collection Tj + \ and increment the number g of top collections. If C' r exists, 
but contains less than rif/2 undeleted symbols, we merge C' r with the largest Tj that contains more 
than one document and split the result if necessary: if the number of undeleted symbols in C' r U Tj 
does not exceed 2n//r, we construct an index for Tj U C' r without deleted symbols; otherwise, we 
split Tj U C' r into two parts Tj , Tj and create indexes for the new sub-collections. Our method 
guarantees us that the number of deleted elements in any collection T% does not exceed 0(n//r) as 
follows from a Theorem of Dietz and Sleator |12| . 

Lemma 1 ( |12) . Theorem 5) Suppose that x\,..., x g are variables that are initially zero. Sup¬ 
pose that the following two steps are iterated: (i) we add a non-negative real value ai to each Xi 
such that Y a i = 1 (ii) set the largest Xi to 0. Then at any time Xi < 1 + h g -\ for all i, 1 < i < g, 
where hi denotes the i-th harmonic number. 

Let mi be the number of deleted elements in the i-th top collection % and 5 = nj/(2rlogr). We 
define x$ = m,;/ 5. We consider the working of our algorithm during the period when the value of rif 
is fixed. Hence, 5 is also fixed and the number of variables Xi is O(r) (some Xj can correspond to 
empty collections). Every iteration of the background process sets the largest Xj to 0. During each 
iteration Y x i increases by 1. Hence, the values of Xi can be bounded from above by the result of 
Lemma |T] Xj < 1 + h, 2 r for all i at all times. Hence mi = 0((nf/2r logr) logr) = 0(n//r) for all 
i because hi = O(logi). Thus the fraction of deleted symbols in each % is 0(l/r). 

It is easy to show that the sub-collections that we use are sufficient for our algorithm. When 
a sub-collection Cj is initialized, Cj is empty. The situation when Cj cannot accommodate a new 
document T n and a preceding subcollection Cj- \ can happen only after max^ — Yt=i max t new 
symbol insertions. Since we spend 0(log e n ■ u(n)) time for constructing A/j+i with each new 
symbol insertion, we can choose constants in such a way that construction of A/j+i is finished (and 
Cj is discarded) after maxj /2 < maxj — Yt= l max t symbol insertions. The situation when Cj 
contains maxj /2 deleted symbols can happen after at least maxj new symbol updates (rnaxj /2 
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■j is discarded before Cj has to be locked 


insertions and maxj /2 deletions). Hence, the collection C 
again. In our description of update procedures we assumed that the parameter n/ is fixed. We 
can maintain the invariant nj = 0 (n) using standard methods; for completeness we provide a 


description in Section A.3 


The space overhead caused by storing copies of deleted elements is bounded by 0(n/r): all Cj 
contain 0 (n/r) symbols and at most every second symbol in each Cj is from a deleted document; 
the fraction of deleted symbols in each % does not exceed 0(l/r). By the same argument, at any 
moment of time at most 0(n/r) symbols are in sub-collections that are re-built. Hence re-building 
procedures running in the background need 0(nw(ri) /r) bits of space. Since each % contains at 
most 0{\Ti\/r) deleted symbols, we can store the data structure V, which enables us to identify 
undeleted elements in any range of the suffix array and is implemented as described in Lemma |3j 
using 0(\%\ log t/t) bits. Data structures V for all 7j need 0(nlogr/r) bits. Hence, the total 
space overhead of Zd compared to Z s is 0(n w ( n ) +1 °s T + lo g a ) 


Counting Occurrences Our dynamic indexes can be easily extended so that pattern counting 
queries are supported. 

Theorem 1 We can augment the indexes Zd of Transfomations^^with 0((n log r)/r) additional 
bits so that all occurrences of a pattern can be counted in 0(t C ount) time, where t count = (i range + 
log n/log log n)(r + r) and r is defined as in the proofs of respective Transformations. If counting 
is supported, update times are increased by 0 (logn/loglogn) additive term per symbol. 

Proof: Every semi-dynamic index for a sub-collection Cj (respectively %) already keeps a vector 
B that enables us to identify the suffixes of already deleted documents in the suffix array. We 
also store each B in a data structure of Navarro and Sadakane 132 that supports rank queries in 
0(logn/loglogn) time and updates in 0(logn/loglogn) time. If B contains 0(\B\/t) zero values, 
then the structure of m needs 0((\B\/t) log r) bits. Using this data structure, we can count the 
number of l’s in any portion B[a..b\ of B in the same time. To answer a counting query, we first 
answer a range-finding query in every sub-collection. For every non-empty range that we found, we 
count the number of l’s in that range. Finally, we sum the answers for all sub-collections. Since a 
range-finding query returns the range of all suffixes that start with a query pattern and each 1 in 
V corresponds to a suffix of an undeleted document, our procedure is correct. □ 


4 Dynamic Indexes 

To obtain our results on dynamic document collections, we only need to plug some currently 
known static indexes into Transformations [l] and [2} For completeness, we prove the statements 
about constructibility of static indexes in Section |A.6[ 

The static index of Belazzougui and Navarro [TJ is (log e n, log <r)-constructible. Their index 
achieves t iange = 0(|P|), Extract = 0(s + £), and tsA = locate = 0(s) for arbitrarily large alphabets; 
it needs nH k + 0(n ) + O (n log ^ g - ) -f-O(n) bits. We apply Transformation j2j with r = loglogn. 

The construction algorithm for this index relies on randomized algorithm for constructing an mmphf 
functions 0 ; therefore the update procedures of our dynamic data structure also rely on randomiza¬ 
tion in this case. The resulting dynamic index uses nH k + O(n^p) + O( w i 0 ^g n ) + 0(n) bits. This 
index achieves f ran ge = 0(|P| loglogn), f ex tract = 0(s + 7), ti OC ate = O(s). Insertions and deletions 
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are supported in 0(|T|log £ n) time and 0(\T\ (log £ n + s)) expected time respectively. If count¬ 
ing queries are also supported, then f C ount = 0(\P\ log log n + logn) and updates take 0(\T\ logn) 
expected time. 

The index of Barbay et al. 1312] is also (log £ n, log cr)-constructible and uses nH\ + O(n^p) + 
o(n\oga) bits. If the alphabet size a = log°^ n, this index achieves t T an ge = 0(|P|), iextract = 
0(s + £), and locate = O(s); it uses nH & + 0(n —|^) + o(nloger) bits. If we set r = loglogn 
and apply Transformation [2j we obtain a dynamic data structure with t range = 0(|P| loglogn), 
^extract = 0(s + £), and tsA = Locate = O(s). For an arbitrary alphabet size a, the index of 
Barbay et al. [3 E| achieves t range = 0(\P\ loglogn), t ex tract = 0((s + £) loglogn), and t S A = 
Locate = 0(s log log a). Again we set r = loglogn and apply Transformation [2j We obtain a 
dynamic index that has query costs Lange = 0(|P| log log a log log n), Lxtract = 0((s + £) log log a), 
and tsA = Locate = O(sloglogcr). Insertions and deletions are supported in 0{\T\\og £ n) time 
and 0(|T|(log £ n + s)) time respectively. If counting queries are also supported, then Lount = 
0(|P| log log n log log o + log n) (resp. Lount = 0(|P| log log n + log n) if a = log 0 ^ 1 -* n) and updates 
take 0(|T| logn) time. 

The index of Grossi and Vitter [22] is (log £ n, log cr)-constructible. It achieves ^locate = 0(log £ n), 


Grange = 0(|P|/ log^ n+log £ n) and textract = 0(1/ \og a n). We apply Transformation [2] with r = 1/5 
for a constant 6. The resulting dynamic index uses 0(n log <r(l +1/5)) = 0(n log a) bits and has the 
following query costs: /locate = 0(log £ n), t range = 0(|P|/ log^ n + log £ n), Extract = 0{t/\og a n). 
As described in Section |W2 in this case the data structure for uncompressed sequence Co relies on 
hashing. Therefore the update procedure is randomized. Updates are supported in 0(|T| log 2e n) 
expected time, but we can replace e with e/2 in our construction and reduce the update time to 
0(|T|log £ n). If counting queries are also supported, then t c oun t = 0(|P|/log CT ra + logn/loglogn) 
and updates take 0(|T| logn) expected time. If we want to support updates using a deterministic 
procedure, then the cost of searching in Co grows to 0(|P|(loglogn) 2 /log CT n + logn). In this case 
Uange = Aount = 0(\P\ (log log n) 2 / log^ U + log n), t loC ate = 0{ log £ n), and textract = 0{t/\og a n). 


5 Dynamic Graphs and Binary Relations 

Let R denote a binary relation between t objects and <t; labels. In this section we denote by n the 
cardinality of R, i.e., the number of object-label pairs. We will assume that objects and labels are 
integers from intervals [1 ,cq] and [l,t] respectively. Barbay et al. [4] showed how a static relation 
R can be represented by a string S. A dynamization of their approach based on dynamic data 
structures for rank and select queries is described in [35]. 

Let M be a matrix that represents a binary relation R\ columns of R correspond to objects and 
rows correspond to matrices. The string S is obtained by traversing M columnwise (i.e., objectwise) 
and writing the labels. An additional bit string N encodes the numbers of labels related to objects: 
N = l ni 01 n2 0 ... l nt , where rij is the number of labels related to the i-th object. Using rank, select, 
and access queries on N and S , we can enumerate objects related to a label, enumerate labels 
related to an object, and decide whether an object and a label are related. 

Deletion-Only Data Structure We keep R in S and N described above; S and N are stored 
in static data structures. If a pair (e, l) is deleted from R , we find the element of S that encodes 
this pair and mark it as deleted. We record marked elements (i.e. pairs that are deleted but are 
still stored in the data structure) in a bit vector D: D[i] = 0 if and only if the pair S'fi] is marked as 
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deleted. We maintain the data structure of Lemma[3]on D. Moreover we keep D in a data structure 
described in (20]; this data structure enables us to count the number of 1-bits in any range of D. 
For each label a we also keep a data structure D a . D a is obtained by traversing the a-th row of M : 
if M[a,j] / 0, then we append 0 to D a if (o, j) is marked as deleted; if M[a,j] / 0 and (a, j) is not 
marked as deleted, we append 1 to D a . For each D a we also maintain data structures for reporting 
and counting 1-bits described above. Finally we record indices of deleted labels and objects in two 
further bit sequences. The static data structures on S and N are implemented as in[2], so that 
rank and select queries are answered in O(loglogcq) time and any £[«] or N[i] can be retrieved in 
constant time. 

If we need to list labels related to an object i, we first find the part of S that contains these 
labels. Let l = ranki(selecto(i — 1,IV), IV) and r = ranki(selecto(i, IV), IV). We list all elements 
of S[l..r] that are not marked as deleted by enumerating all 1-bits in D[l..r}. Then we access and 
report 5[ii], £[*2], ..., £[*/], where i\, Z2, • • if are positions of 1-bits in D[l..r\. In order to 
list objects related to a label a, we find positions of 1-bits in D a . Then we access and report 
select a (ji, S), select a (j2, S), ..., where j 1, j'2, ... denote positions of 1-bits in D a . In order to 
determine whether an object i and a label a are related, we compute d = rank a (r, S) — rank a (Z, S), 
where l and r are as defined above. If d = 0, then the object i and the label a are not related. If 
d = 1, we compute j = select a (rank a (r, S), 5); i and a are related if and only if D[j] = 1. 

When (e, l ) is deleted, we find the position j of (e, l) in S and set D[j] = 0; j can be found with 
a constant number of rank and select queries. We also set D a [j'] = 0 for j' = rank a (S,j). When an 
empty label or an empty object is removed, we simply record this fact by adding it to a compact 
list of empty labels (resp. empty objects). When the number of pairs that are marked as deleted 
exceeds n/r, we start the process of re-building the data structure. The cost of re-building is 
distributed among the following updates; we will give a more detailed description in the exposition 
of the fully-dynamic data structure. 

Fully-Dynamic Data Structure We split a binary relation R, regarded as a set of object-label 
pairs, into subsets and keep these subsets in data structures Co, Ci, ..., C r , Li, ..., L r , and Ti, 

..., T g for g = @(r). We set the parameter r = log log n. Only Co is stored in a fully-dynamic 
data structure, but we can afford to keep Co in O(logn) bits per item because it contains only a 
small fraction of pairs. All other pairs are stored in deletion-only data structures described above. 
Distribution of pairs among subsets and procedures for re-building deletion-only data structures 
are the same as in Section [3| To simplify a description, we will not distinguish between a subset 
and a data structure that stores it. 

Co contains at most maxo = 2 n/ log 2 n pairs. Each structure Cj for r > i > 1 contains at most 
rnaxj = 2n/log 2 ^ l£ n pairs. Every Tj contains at most 2n/r pairs. Data structure Co contains 
object-label pairs in uncompressed form and uses O(logn) bits per pair. For every object i that 
occurs in Co we keep a list Li that contains all labels that occur in pairs (i, ■) € Co; for each label 
a that occurs in Co we keep a list of objects that occur in pairs (-,a) G Co- Using these lists we 
can enumerate all objects related to a label or labels related to an object in Co in 0(1) time per 
datum. If we augment lists L t with predecessor data structures described in |Tj, we can also find 
out whether an object i and a label a are related in 0((loglogcp) 2 ) time. 

All pairs in Ci,Li,..., C r , L r , and Ti, T r are kept in deletion-only data structures described 
above. A new object-label pair (i, a) is inserted into Co if Co contains less than maxo pairs. 
Otherwise we look for the smallest j, 0 < j < r, such that |Cy_|_i| + |Cy| + 1 < max J+ i. We lock 
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C j by setting L j = C j, C j = 0 and initiate the process of creating Nj+i = Cj U Cj+i U {(z,a)}. 
If |Cj+i| + |Cj| + 1 < maxj+i for all i < r, we lock C r and start the process of constructing 
Nj + i = C r U {(i,a)}. The cost of creating Nj is distributed among the next maxj updates in the 
same way as in Section [3} We observe that data structures Tempi are not needed now because each 
update inserts only one element (pair) into the relation R. We guarantee that each structure C i for 
some 1 < i < r contains at most rnaxj /2 pairs marked as deleted and Tj for 1 < i < r contains an 
0(l/r) fraction of deleted pairs. Procedures for re-building data structures that contain too many 
pairs marked as deleted are the same as in Section [3] 

Our fully-dynamic data structure must support insertions and deletions of new objects and 
labels. An object that is not related to any label or a label that is not related to any object can be 
removed from a data structure. This means that both the number of labels 07 and the number of 
objects t can change dynamically. Removing and inserting labels implies changing the alphabets of 
strings S that are used in deletion-only data structures. Following [35] we store two global tables, 
NS and SN; SN maps labels to integers bounded by O(ai) (global label alphabet) and NS maps 
integers back to labels. We also keep bitmaps GCi and GT, GLi , and GNi for all subsets Cj, 
Li, Ni, and T t . GCi[j] = 1 if the label that is assigned to integer j occurs in Q and GCi\j] = 0 
otherwise; GT), GLi, and GNi keep the same information for subsets Tj, Lj, and Nj. Using these 
bit sequences we can map the symbol of a label in the global alphabet to the symbol of the same 
label in the effective alphabet]^] used in one of subsets. When a label a is deleted, we mark SIV[a] 
as free. When a new label a' is inserted, we set S'IV[a 7 ] to a free slot in SN (a list of free slots is 
maintained). When some subset, say Q is re-built, we also re-build the bit sequence GCi. 

In order to list objects related to a label a, we first report all objects that are related to 
SIV[a] and stored in Co- Then we visit all subsets Q, Lj, and Tj and report all objects related 
to ranki(SIV[a], GGj), ranki(SIV[a], GLi), and ranki(SIV[a], GT) respectively. We remark that a 
global symbol of a label can be mapped to a wrong symbol in the local effective alphabet. This 
can happen if some label a' is removed and its slot in SN[] is assigned to another label a but the 
bitmap of say GCi is not yet re-built. In this case ranki (S'IV[a], GCi) will nrap a to the symbol for 
the wrong label a!. But a! can be removed only if all object-label pairs containing a! are deleted; 
hence, all pairs (i, a') in Cj are marked as deleted and the query to Q will correctly report nothing. 
We can report labels related to an object and tell whether a certain object is related to a certain 
label using a similar procedure. We visit O(loglogn) data structures in order to answer a query. In 
all data structures except for Co, we spend O(loglogcq) time per reported datum. An existential 
query on Co takes O((loglog0j) 2 ) time; all other queries on Co take 0(1) time per reported datum. 
Hence all queries are answered in O(loglognloglogfT^) time per reported datum. A counting query 
takes 0(logn/loglogn) time in each subset. Hence, we can count objects related to a label or 
labels related to an object in O(logn) time. 

All bit sequences D and D a in all subsets use 0((n/r) log r) bits. Every string S stored in a 
deletion-only data structure needs |5|Ffo(5') + o(|Sj log 07) bits. Hence all strings S use at most 
nH+o(n log 07) bits, where H = Yti< a <ai it (p- Bit sequences GCi, GLi, and GTj use 0(<jit) = 
o(n log 07) bits. Now we consider the space usage of bit sequences N stored in deletion-only data 
structures. Let m,; denote the number of pairs in a data structure Tj. N consists of rrq l’s and t O’s. 
If nr.* > t, then the bit sequence N stored as a part of Tj uses mj log Tn l+ l = O(mj) bits. If t > rrii, 
N uses O(mjlogr) bits because mj = 0(n/r). Hence all N stored in all Tj use O(nlogr) bits. 
In our data structure we set r = log log n. If 02 = H(log 1//4 n), 0(n log r) = o(n log 07). Otherwise 

4 An effective alphabet of a sequence S contains only symbols that occur in S at least once. 
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t = fi(n/logn) because n < t ■ op, if t = f2(n/logn), 0(n log r) = o(t log t). Data structures that 
are re-built at any moment of time contain Ofn/r ) elements and use 0(y logu;) = o(n log 07 ) bits. 
Extra space that we need to store elements marked as deleted is bounded by o(n log 07 ); this can 
be shown in the same way as in Section [3j 

Theorem 2 A dynamic binary relation that consists of n pairs relating t objects to 07 labels can 
be stored in nH + o(n log 07 ) + oft log t) + Oft + n + 01 log n) bits where H = ^ 1<a<(Ti yf log ff- and 
n a is the number of objects related to a label a. We can determine whether an object and a label are 
related in O(loglog 02 log log n) time and report all objects related to a label (resp. all labels related 
to an object) in 0((k + 1) log log 02 log logn) time, where k is the number of reported items. We can 
count objects related to a label or labels related to an object in O(logn) time. Updates are supported 
in 0 (log e n) time. 

Directed graph is a frequently studied instance of a binary relation. In this case both the set 
of labels and the set of objects are identical with the set of graph nodes. There is an edge from a 
node u to a node v if the object u is related to the label v. 

Theorem 3 A dynamic directed graph that consists of 01 nodes and n > edges can be stored 
in nH + ofn log 0 /) + Ofn + 07 logn) bits where H = X^i< a <o- ; yf log ff- and n a is the number 
of outgoing edges from node a. We can determine if there is an edge from one node to another 
one in 0 (log log 01 log logn) time and report all neighbors (resp. reverse neighbors) of a node in 
0((fc + l) log log 0 ^ log logn) time, where k is the number of reported nodes. We can count neighbors 
or reverse neighbors of a node in O(logn) time. Updates are supported in 0(log e n) time. 

6 Conclusions 

In this paper we described a general framework for transforming static compressed indexes into 
dynamic ones. We showed that, using our framework, we can achieve the same or almost the same 
space and time complexity for dynamic indexes as was previously obtained by static indexes. Our 
framework is applicable to a broad range of static indexes that includes a vast majority of currently 
known results in this area. Thus, using our techniques, we can easily modify almost any compressed 
static index, so that insertions and deletions of documents are supported. It will likely be possible 
to apply our framework to static indexes that will be obtained in the future. Our approach also 
significantly reduces the cost of basic queries in compact representations of dynamic graphs and 
binary relations. We expect that our ideas can be applied to the design of other compressed data 
structures. 

Acknowledgments The authors wish to thank Djarnal Belazzougui for clarifying the construc¬ 
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A.l Reporting 1-Bits in a Bit Vector 


We show how to store a bit vector with a small number of zeros in small space, so that all 1-values 
in an arbitrary range can be reported in optimal time. This result is used by our method that 
transforms a static index into an index that supports deletions. We start by describing an 0(?r)-bit 
data structure. Then we show how space usage can be reduced to 0((n logr)/r) 

Lemma 2 There exists an 0(n)-bit data structure that supports the following operations on a bit 
vector B of size n: (i) zero(i) sets B[i ] = 0 (ii) report(s,e ) enumerates all j such that s < j < e 
and B[j] = 1. Operation zero(i) is supported in 0( log £ n) time and a query report(s, e ) is answered 
in 0(k) time, where k is the number of output bit positions. 

Proof: We divide the vector B into words W i , ..., TW f | | / log n] of logn bits. We say that a word 
Wt is non-empty if at least one bit in Wt is set to 1. We store the indices of all non-empty words 
in a data structure that supports range reporting queries in 0 (k ) time, where k is the number 
of reported elements, and updates in 0(log £ n) time [33]. For every word W) we can find the 
rightmost bit set to 1 before the given position p or determine that there is no bit set to 1 to the 
right of p in 0(1) time. This can be done by consulting a universal look-up table of size o(n) bits. 
To report positions of all 1-bits in B[s..e], we find all non-empty words whose indices are in the 
range [[s/logn], [e/lognj]. For every such word, we output the positions of all 1-bits. Finally, we 
also examine the words W\ s / i ogn j and W^ e /\ ogn -\ and report positions of 1-bits in these two words 
that are in B[s..e\. The total query time is 0(k). Operation zero(i ) is implemented by setting the 
bit i — [i/lognj logn in the word W^/\ ogn -\ to 0. If IF|-j/i ogn ] becomes empty, we remove [~z/logn] 
from the range reporting data structure. □ 

Lemma 3 Let B be a bit vector of size n with at most O(-) zero values for r = 0(logn/ log logn). 
B can be stored in 0(n ] ^^)-bit data structure that supports the following operations on B: (i) 
zero{i ) sets B[i] = 0 (ii) report(s,e ) enumerates all j such that s < j < e and B[j] = 1. Operation 
zero(i ) is supported in 0(log £ n) time and a query report{s,e) is answered in 0(k) time, where k 
is the number of output bit positions. 

Proof: We divide B into words Wi of r bits. Indices of non-empty words are stored in the the data 
structure B' , implemented as in Lemma [ 2 } Every word Wi is represented as follows: we store the 
number of zeros in W % using O(logr) bits. A subword with / zeros, where 0 < / < r, is encoded 
using /(logr) bits by specifying positions of 0-bits. For every word Wi we can find the rightmost 
bit set to 1 before the given position p or determine that there is no bit set to 1 to the right of p 
in 0(1) time. This can be done by consulting a universal look-up table of size o[n ) bits. Query 
processing is very similar to Lemma [2j To report 1-bits in B[s..e], we find all non-empty words 
whose indices are in the range [ [~Z/ r ~|, |_ r / r J]- For every such word, we output the positions of all 
1-bits. Finally, we also examine the words W\ s / T \ and W^ e / T ^ and report positions of 1-bits in these 
two words that are in B[s..e\. 

Operation zero{i) is implemented by setting the corresponding bit in some word Wi to 0 and 
changing the word encoding. If Wj becomes empty, the z-th bit in B' is set to 0. Issues related to 
memory management can be resolved as in [35]. 

We need 0(n/r) bits to store the data structure B' for non-empty words. Let nf denote the 
number of words with / zero values. All words Wi need X(f=i n / ' / ' logr = logr^zzy ' / = 
0 ((?z/r) logr) because Yl n f ' / = 0 (n/r). □ 
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A.2 Dynamic Document Collection in O(nlogn) bits 

A generalized suffix tree is a compact trie that contains all suffixes of all documents. Trie edges 
are labeled with strings and leaves correspond to suffixes of documents. Each document ends with 
a unique special symbol $j, hence all suffixes are unique. Every internal node has at least two 
children. Let path(v ) denote the string obtained by concatenating labels on the path from the root 
to a node v. The locus of a string P is the highest node v such that P is a prefix of path(y). For 
every leaf u, path(u ) corresponds to a suffix. Each occurrence of P corresponds to a unique leaf 
that descends from the locus node of P. See e.g., [23] for a more detailed description of suffix tree. 

We keep the collection Co in a generalized suffix tree (GST) augmented with suffix links. A 
suffix link for a node u labelled with a string aX points to a node v labelled with a string X. We 
use the algorithm of McCreight for inserting a new string into a suffix tree. When a new text T is 
inserted we find the position of the string T in the GST. Then we insert a leaf ui labelled with the 
suffix T[1..|T|]; if necessary, we also insert a parent node of ui into the GST. Then we follow the 
suffix link in the lowest “old” ancestor of ui (i.e., the lowest node on the path to ui that existed 
before the insertion of T started). If this link points to some node v, we descend from v as far as 
possible. Then we insert a new leaf vi corresponding to T[2..|T|] and possibly the parent of m. This 
procedure continues until all suffixes of T are inserted. Deletions are symmetric. The number of 
traversed edges and inserted nodes is 0(|T|). Every insertion of a new node takes 0(1) time. 

To navigate in the suffix tree, we need a data structure D{u) in each internal node u. For every 
child Ui of u, D(u) contains the first character of the edge label l(u,Ui), where l(v,w) denotes 
an edge between nodes v and w. For every alphabet symbol a, D(u) returns a pointer to the edge 
l(u, Ui) whose label starts with a or reports that such edge does not exist. We can implement D(u) 
in such way that queries and updates take 0(1) time. If the alphabet size a is poly-logarithmic 
in to, we can use the data structure of Fredman and Willard [16] . If the alphabet size is large, 
a = log 1 *^ 1 ' to, we use the dynamic hashing to keep all children of a node u. In the latter case, 
the update time is randomized. If the alphabet size is large and updates are using a deterministic 
algorithm, then we implement D[u) as an exponential tree [I]; in this case an appropriate child Ui 
of u is found in O((log log a ) 2 ) time. 

Occurrences of a pattern P are reported using the standard suffix tree procedure. We traverse 
the search path for a pattern P starting at the root node and choosing the child Ui of the current 
node u that is labelled with a prefix of P until the locus of P is found or the search cannot continue. 
In each visited node u we search for pi in D(u), where pi is the next unprocessed symbol in P. If 
Ui is labelled with a prefix of P, the search continues in to*. Otherwise the search ends on the edge 
from u to m. When the locus of a pattern P is found, we can report all occurrences of P in 0(1) 
time per occurrence. 

We can also modify our data structure so that the locus of P is found in 0(|P|/ log CT to (log log cr) 2 + 
log to) time [36] . If the update procedure uses randomization, then the locus of P can be found in 
0(\P\/ log CT to + log £ to) time. 

A.3 Maintaining the Sizes of Sub-Collections after Updates in 

Transformation 2 

We show here how to maintain the invariant rif = Q(n). If to > 2 rif after a document insertion, we 
set to/ = to. Maximal sizes max* of subcollections Cj are changed accordingly. All top sub-collections 


22 


7 1 that contain less than nf/r symbols are merged into new collections T( of total size between 
nf/r and 2 rif/r symbols. During the next nf/r symbol updates (that is, insertions and deletions 
of texts of total size nf/r), we construct new collections Tf. 

7 i that must be re-built are processed one-by-one. Since at any moment only one 7 f is con¬ 
structed, this process needs 0(nw{n) /t) bits of workspace. 

If n < rif/ 2 after a document deletion, we set nj = n/2. All % that contain more than one 
document and satisfy \%\ > nf/r are split into two subcollections T( . Each document T from %, 
such that \T\ > nf/r is assigned to its own one-document collection T- . Other documents are 
assigned to collections of size between nf/r and n//2r symbols. We also move all documents from 
collections Cj, j = 0 , ,r, to one or two new collections T( and T{, such that T?, T? contain 
between nf/r and n// 2 r symbols. At any moment only one new collection If is constructed. Hence 
this process also needs 0(nw{n)/r) bits of workspace. We can schedule the rebuilding in such way 
that all T( are finished after the following rif symbol updates. 

We also start the re-building process every time when a one-document collection % is inserted 
or deleted. In this case we update the value of n/ and re-build the subcollections as described 
above (if there is another process for replacing 7 1 with 7/ that currently runs in the background, 
then this process is terminated). Since % contains (resp. contained) a document T of size P(n/r), 
re-building subcollections takes 0{\T\t ■ u(n )) time. [^] Hence, rif = @(n) at any time. 


A.4 Dynamic Transformation with Lower Update Cost 

Transformation 3 Suppose that there exists a static ( u(n),w(n))-constructible index X s that uses 
|£|<^(S) space for any document collection S. Then there exists a dynamic index T c i that uses 
|£|<^(5) + 0(|5|( logr ^ los<J + w(n))) space for any parameter r = O (log n/loglogn); supports 
insertions and deletions of documents in 0(u(n) loglogn) time per symbol and 0(u(n) ■ r + tsA + 
log £ n) time per symbol respectively. Update times are amortized. The asymptotic cost of range¬ 
finding increases by factor O (loglogn); the costs of extracting and locating are the same inZ s and 
Id- 

We divide the document collection C into sub-collections Ci,...,C r such that \Cf\ < max* and 
maxj = 2{n/ log 2 n)2 l n for i = 0,1,... ,r. Thus the number of sub-collections is r = O(loglogn). 
All collections C* are organized, queried, and updated in exactly the same way as in Transforma¬ 
tion [lj Since we must query O(loglogn) sub-collections, the time to answer a range-finding query 
grows by O(loglogn) factor. Deletion time is the same as in Transformation [l] because the same 
deletion-only indices for sub-collections are used. Analysis of insertion costs is similar to Transfor¬ 
mation [l] Between two global rebuilds every text is inserted into each sub-collection at most once. 
When a sub-collection Cj is re-built, we insert fi(|Cj|) new symbols into C*. Hence, re-building a 
collection incurs an amortized cost of 0(u{n )) on every new symbol in C*. Thus the total amortized 
cost of an insertion is 0 (u(n) loglogn). 

5 We assume here that when a new document T is inserted, then T is stored in uncompressed form. Hence, the 
procedure that constructs a new one-document collection 71 can use 0(|7i| logcr) bits of space. Alternatively we can 
assume that very big documents are split into several parts of at most n/r symbols and each part is kept in a separate 

• 77 . 
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A.5 Analysis of Space Usage 


In this Section we show that the space overhead caused by keeping deleted symbols is bounded. 
Suppose that n/r symbols from some documents are marked as deleted in a collection C. Let C 
denote the collection C without deleted documents. In this section we consider the case when the 
space usage of C is bounded by nHk + o(n) for some k > 1. 

A context q is an arbitrary sequence of length k over an alphabet a; for simplicity we identify a 
context Cj by its index i where i E [1, cr k ]. Let f a ^ and /' i denote the number of times the symbol 
a occurs in the context i in C and C' respectively. Let n* = E a fa,i and n[ = Ylaf'ai• The k-th 
order empirical entropy of C is defined as Ec,es fe Eae£ ./a,?; log 

We need F\ = V. V f ■ log -fX bits to keep all deleted symbols. We express log = log + 

z c Ja,i Ja,i 

7l f f f 7l f 7l f 

log -ft + log -A 21 < log + log -ft-. Furthermore Ei E 0 f a i l°g jX < y log c. We can also show 

J a,i J a,x ••'i J a,i ’ J a,i 

that J2i n i log w = o(n). All contexts i are divided into three sets. Let I\ contain all context 
indices, such that n* > n'log 2 n. For all i E h, n* log 2 n > n[ > rij(loglogn) 2 . For all i E I 3 , 
ni(loglogn ) 2 > n'. Since E ^ = 0(n ), E ie /i n i lo S ^ K log ^ = 0(n)(]^ + lokfe) = 

o(n). Since = 0(n/r), Eie / 3 n ^°g yf = 0(”l°g ( ' 3 ' )n ) = °( n ) f° r r = fI(log ( ' 3 ' ) n). Hence 

F\ = (n/r) logo - + o(n). 

The contexts of most symbols in C are the same as in C. Only first k < log^n/2 symbols 
in each document can change context (because the previous document was deleted). The total 
number of such symbols is bounded by p ■ k. These symbols are encoded in O(plogn) + o(n) 
bits. Contexts of remaining symbols in C remain unchanged. The space consumed by other (not 
deleted) symbols can be still slightly higher than optimal. Let f ai = f a ,i — fa j and n a ,i = Ea fai = 
rii — n(. For simplicity we ignore symbols that changed contexts. All undeleted symbols use E u = 
Ui Ea fa,ij^ bits. Optimal compression of the same sequence would use E a = EiEa/a,iJ^”- 

F 2 = E\ - E -2 < Ei Ha fa,i l°g Wl = Ei^il°g!“ = 0(n). Thus the total additional space is 
Fi + F 2 = 0 (n^) + o(nloga). 


A.6 Construction Times of Static Indexes 

Arbitrarily Large Alphabets It can be shown that the index of Belazzougui and Navarro (7j 
is (log £ n, log <r)-constructible. This index consists of three components. First, a BWT transform 
is applied to the source text. Then a data structure of Barbay et al. (3j for the BWT-transformed 
text is created; this data structure supports select queries in 0(1) time. Second, a compressed 
suffix tree for the source text is created. Third, we keep w-links on the compressed suffix tree. A 
w-link for an alphabet symbol a points from a node u that is labelled with a suffix A to a node or 
a position in a tree that is labelled with a suffix aX] if aX does not occur, then the link for a and 
u does not exist. W-links are implemented using a collection of monotone minimum perfect hash 
functions (mmphf) [BJ. 

The index from |7] can be constructed as follows. First, we construct a compressed suffix tree in 
0 {n\og £ n) time using 0 (n logo - ) bits of extra space by employing the algorithm described in [25]. 
Then we traverse the tree and produce mmphf in 0{n) randomized time. Next we obtain the BWT 
transform of T; this step takes 0{n log a) extra bits and 0(n) time. Finally, we construct the data 
structure from [3]. Our method for constructing the data structure is as follows: Let T b denote the 
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BWT-transformed sequence. We split T b into chunks Cj, such that each chunk but the last consists 
of a 2 symbols and the last Ci consists of at most a 2 symbols. Then the data structure of |3] is 
constructed for each chunk. The symbols of Ci are distributed among O(logcr) groups G s . Each 
symbols in Gi occurs at least 2* and at most 2 l+l times for i = 1,2,..., log \ Ci\. This step takes 
linear time and 0(a log a) extra bits. Let C s ^ denote the subsequence of C s induced by symbols 
of Gi ; let C S (G ) denote the sequence that specifies the group index for every symbol of C s . We 
replace C s with C S (G ) and subsequences C s ,y Data structures supporting rank, select queries are 
stored for C S (G ) and all C s ^. The data structure for C S (G) is implemented as described in [14j ; the 
data structures for C S) i are implemented as described by Golynski et al. El Both data structures 
can be constructed in linear time using o( 1(7^1) additional bits. If select queries on each chunk 
can be answered in 0(1) time, we can also answer select queries on T b using 0(n) additional bits. 
The method is based on keeping a bit vector B a = l^OD 2 ... 1-TO for every symbol a, where / is 
the number of chunks and ji is the number of times a occurs in the i-th chunk Ci. We create a 
data structure that answers rank and select queries on B. Then, we can identify the chunk that 
contains the 1-th occurrence of a by answering a query ranko(selecti(l , B a ), B a ). Then we identify 
the position of Z-th occurrence of a by a query select a {l — l', Ch ), where l' = rank\{h — 1, B a ). Data 
structures for a chunk Ci can be constructed in linear time using 0(|0j| logo - ) bits of workspace. 

Index of Barbay et al. [3] This index is a part of the data structure of Belazzougui and 
Navarro [7j- Hence it is also (log £ n, log er)-constructible. Unlike the structure in [7] the index 
described in [3J can be constructed by a deterministic algorithm. 

0(n log cr)-bit Index The index of Grossi and Vitter j22| is also (log £ n, log cr)-constructible. Their 
index consists of the compressed suffix array CSA and functions = STD 1 [Shi [i] + k\ for 

k = 1,..., log* £ n,... and i = 0,1,... (1/e). Using the algorithm of Hon et al |25j, we can construct 
CSA and in 0(ralog logo - ) time using 0(?ilogcr) bits. To speed up the range finding, Grossi 
and Vitter store a series of suffix trees for subsequences of the suffix array. The top level tree is 
a compressed trie over si = n/\og a n suffixes SH[1], S44[l + log CT n], .... On the next level, we 
consider each subarray SA^ = SA[(h — lpog^n + \..h\og a n]. We select every log^ 2 n-th suffix 
from SAh and construct a suffix tree for this set of suffixes. On the next level, we consider subarrays 
of size log^ £ ^ 2 n, select every log^ 2 n-th suffix and construct a suffix tree for the resulting subset. 
This subdivision continues untill the size of the subarray is equal to log £ n. These suffix tree can 
be constructed in 0{n\og £ n) time and 0(n logo - ) bits: the total number of leaves in all suffix trees 
is o(n ) and a suffix tree for m suffixes can be constructed in 0(m log £ n) time [25]. The search for a 
range of the suffix array that corresponds to the query pattern is described in|22j. Thus the index 
from [22] can be constructed in 0(nlog £ n) time using O(nlogcr) space. 
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